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1. Motivations for the Book 

Building systems that are trustful is one of the main challenges which soft- 
ware developers are facing. Dependability-related concerns have accompa- 
nied system developers since the first day these systems were built and 
deployed. Obviously various things have changed since then, including, the 
nature of faults and failures, the complexity of the systems, the services 
they deliver and the way our society uses these systems. But the need to 
deal with various threads (such as failed components, deteriorating envi- 
ronments, component mismatches, human mistakes, intrusions and software 
bugs) is still in the core of software and system research and development. 
As computers are now penetrating various new domains (including the crit- 
ical ones) and the complexity of the modern systems is growing, achieving 
dependability remains central for system developers and users. 

Accepting that errors always happen in spite of all the efforts to elim- 
inate the faults that might cause them is in the core of dependability. To 
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this end various fault tolerance mechanisms have been investigated by re- 
searchers and used in industry. Unfortunately, more often than not these 
solutions exclusively focus on the implementation (e.g. they are provided 
as middleware/OS services or libraries) ignoring other development phases, 
most importantly the earlier ones. This creates a dangerous gap between 
the requirements to build dependable (and fault tolerant) systems and try- 
ing to meet them by exclusively using specific fault tolerance mechanisms 
at the implementation step. 1 One of the consequences of this is that there 
is a growing number of reported situations in which fault tolerance means 
undermine the overall system dependability as they are not used properly. 

We believe that fault tolerance needs to be explicitly included into the 
traditional software engineering theories and practices, and it should be- 
come a part of all steps of software development. As the current software 
engineering practices tend to capture only normal behaviour, assuming that 
all faults can be removed during development, new software engineering 
methods and tools need to be developed to support explicit handling of 
abnormal situations. Moreover, every phase in the software development 
process needs to be enriched with the phase-specific fault tolerance means. 
Generally speaking, integrating fault tolerance into software engineering 
requires: 

• integrating fault tolerance means into system models starting from 
the early development phases (i.e. requirement and architecture); 

• making fault tolerance-related decisions for each appropriate model 
at each phase by explicit modelling of faults, fault tolerance means 
and dedicated redundant resources (with a specific focus on fault 
tolerant software architectures); 

• ensuring correct transformations of the models used at various de- 
velopment phases with a specific focus on transformation of the 
fault tolerance means; 

• supporting verification and validation of the fault tolerance means; 

• developing dedicated tools for fault tolerance modelling; 

• providing domain-specific application-level fault tolerance mecha- 
nisms and abstractions. 

This book consists of the chapters describing novel approaches to in- 
tegrating fault tolerance into software development process. They cover a 
wide range of topics focusing on fault tolerance during the different phases 
of the software development, software engineering techniques for verifica- 
tion and validation of fault tolerance means, and languages for supporting 
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fault tolerance specification and implementation. Accordingly, the book is 
structured into the following three parts: 

• Part A: Fault tolerance engineering: from requirements to code; 

• Part B: Verification and validation of fault tolerant systems; 

• Part C: Languages and Tools for engineering fault tolerant systems. 

The next section of this chapter briefly introduces the main dependabil- 
ity and fault tolerance concepts. Section 3 defines the software engineering 
realm. Sections 4, 5 and 6 introduce the three areas corresponding to the 
book parts and briefly outline the current state or research. The last section 
summarises the content of the book. 

2. Dependability and Fault Tolerance 

Dependability is usually defined as the system ability to deliver service that 
can be justifiably trusted. 2 Ensuring the required dependability level for 
complex computer-based systems is the challenge which many researchers 
and developers working in various relevant domains are facing. The diffi- 
culties here are coming from various sources, including the cost of making 
system dependable, the growing complexity of modern applications, their 
pervasiveness and openness, proliferation of computer-based systems into 
new emerging domains, wider reliance our society puts on these systems, 
complexity of ensuring the impact which various dependability means have 
on the resulting system dependability, difficulties in defining realistic and 
practical assumptions under which these means are to be applied, diffi- 
culties in setting dependability requirements and tracing them through all 
development phases, etc. 

Dependability is an integrated concept encompassing a variety of at- 
tributes, including availability, reliability, safety, integrity, and maintain- 
ability. Four general means can be employed to attain dependability: 2 fault 
prevention, fault tolerance, fault removal, and fault forecasting. Clearly in 
practice one needs to apply a combination of all of these means to ensure 
the required dependability It is important to understand that all these ac- 
tivities arc centred around the concept of faults where possible faults are 
prevented or eliminated by using appropriate development and verification 
techniques, the remaining faults are tolerated at runtime to avoid system 
failures and estimated to help in predicting their consequences. 

In this chapter, we follow the dependability terminology from 2 which 
introduces the following causal chain of dependability threads. It is said 
that the system failure to deliver its service is caused by an erroneous 
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system state, which, in its turn, is caused by a triggered fault. That means 
that faults can be silent for some time and that their triggering does not 
necessarily cause immediate failure. Errors are typically latent and the aim 
of fault tolerance is to detect them and deal with them and their causes 
before they make systems fail. 

This book focuses on fault tolerance means that are used to avoid system 
failures in the presence of faults. The essence of fault tolerance 3 is in detect- 
ing errors and carrying the following system recovery. Generally speaking, 
during system recovery one needs to conduct two steps: error handling and 
fault handling. 

Error handling can be conducted in one of the following three ways: 
backward error recovery (sometimes called rollback), forward error recov- 
ery (sometimes called rollforward) or compensation. Backward error recov- 
ery returns the system into a previous (assumed to be correct) state. The 
typical techniques are checkpoints, recovery points, recovery blocks, conver- 
sations, file backup, application restart, system reboot, transaction abort, 
etc. Forward error recovery moves the system into a new correct state, this 
type of recovery is typically carried out by employing exception handling 
techniques (found, for example, in many programming languages, such as 
Ada, Java, C++, etc.). Note that backward error recovery is usually in- 
terpreted as a particular case of forward error recovery. There have been 
considerable amount of work on defining exception handling mechanisms 
suitable for different domains, development and modelling paradigms, types 
of faults, execution environments, etc. (see, for example, recent book 4 ). It is 
worth noting here that, generally speaking, the rollforward means are more 
general and run-time efficient than the rollback ones as they take advantage 
of the precise knowledge of the system erroneous state and move system 
into a correct state by using application-specific handlers. To conduct com- 
pensation one needs to ensure that the system contains enough redundancy 
to mask errors without interrupting the delivery of its service. 

Various replication and software diversity techniques fall into this cate- 
gory as they mask the erroneous results without having to move the system 
into a state which is assumed to be correct. A wide range of software di- 
versity mechanisms, including recovery blocks, conversations and N-version 
programming, has been developed and widely used in industry. 

Fault handling activity has a nature which is very different from error 
handling as it intends to rid the system from faults to avoid new errors they 
may cause in the later execution. It starts with fault diagnosis, followed by 
isolation of the faulty component and system reconfiguration. After that 



November 9, 2010 3:29 



WSPC - Proceedings Trim Size: 9in x 6in SEFT-HM-070213 



5 

the system or its part needs to be re-initialized to continue to provide its 
service. Fault handling is usually much more expensive than error handling 
and is more difficult to apply as it typically requires some part of the system 
to be inactive to conduct reconfiguration. 

Fault tolerance never comes for free as it always requires additional (re- 
dundant) resources which are employed in runtime for conducting detection 
and recovery. Specific fault tolerance mechanisms require various types of 
redundancy such as spare time, additional memory or disk space, extra ex- 
change channels, additional code or messages, etc. Typically each scheme 
uses a combination of redundant resources, for example, simple retry always 
uses time redundancy, but may need extra disk space and code to save the 
checkpoints if we need to restore the system state before retrying. 

The choice of the specific error detection, error handling and fault han- 
dling techniques to be used for a particular system is directly related to and 
depends upon the underlying fault assumptions. For example, replication 
techniques are typically used to tolerate hardware faults, whereas software 
diversity is employed to deal with software design bugs. 

Let us now briefly discuss the main challenges in developing fault toler- 
ant systems. 5 First of all, the fault tolerance means are difficult to develop 
or to use - this is because they increase system complexity by adding a 
new dimension to the reasoning about system behaviour. Their application 
requires a deep understanding of the intricate links between normal and 
abnormal behaviour and states of systems and components, as well as the 
state and behaviour during recovery. Secondly, fault tolerance (e.g. soft- 
ware diversity, rollback, exception handling) is costly as it always uses re- 
dundancy. Thirdly, system designers are typically reluctant to think about 
faults at the early phases of development. This results in making earlier 
decisions ignoring fault tolerance, which may make it more difficult or ex- 
pensive to introduce fault tolerance at the later phases. More often than 
not, the developers fail to apply even the basic principles of software fault 
tolerance. For example, there is no focus on (i) a clear definition of the fault 
assumptions as the central step in designing any fault tolerant system, (ii) 
developing means for early error detection, (iii) application of recursive sys- 
tem structuring for error confinement, (iv) minimising and ensuring error 
confinement and error recovery areas, and (v) extending component speci- 
fication with a concise or complete definition of failure modes. We can refer 
here to recent paper reporting a high number of mistakes made in hand- 
ing exceptions in the C programs 6 and to the Interim Report on Causes 
of the August 14th 2003 Blackout in the US and Canada. 7 which clearly 
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shows that the problem was mostly caused by badly designed fault toler- 
ance: poor diagnostics of faults, longer-than-estimated time for component 
recovery, failure to involve all necessary components in recovery, inconsis- 
tent system state after recovery, failures of alarm systems, etc. It is worth 
reminding here, as well, that the failure of the Ariane 5 launcher was caused 
by improper handling of an exception. 8 

All the factors above contribute to the fact that a substantial part of 
system failures are caused by mistakes in fault tolerance means. 1 We believe 
that a closer synergy between software engineering phases, methods and 
tools and fault tolerance will help alleviating such current problems. 

3. Defining Software Engineering 

Software engineering (SE) is a quite new field of Computer Science, recog- 
nized in the 1968 NATO conference in Garmisch (Germany) as an emergent 
discipline. Today, many different definitions of software engineering have 
been proposed, trying to explain its main characteristics: 

• "Application of systematic, disciplined, quantifiable approach to the 
development, operation, and maintenance of software" ; 9 

• "Software engineers should adopt a systematic and organised ap- 
proach to their work and use appropriate tools and techniques de- 
pending on the problem to be solved, the development constraints 
and the resources available"; 10 

• "Software Engineering is the field of computer science that deals 
with the building of software systems that are so large or so complex 
that they are built by a team or teams of engineers"; 11 

• "Software engineering is the branch of systems engineering con- 
cerned with the development of large and complex software inten- 
sive systems" . ... "It is also concerned with the processes, methods 
and tools for the development of software intensive systems in an 
economic and timely manner'' . 12 

Most of those well known definitions point out different characteristics or 
perspectives to be used when looking at the software engineering discipline. 
Some examples, taken from experienced computer scientists made in the 
last forty years, will help to identify the key points common to most of 
the definitions of SE, and help us to illustrate why and when the software 
engineering discipline is needed. The i) Ariane 5, the Therac-25 radiation 
therapy machine, the Denver Airport (and others) big software failures, 13 



November 9, 2010 3:29 



WSPC - Proceedings Trim Size: 9in x 6in SEFT-HM-070213 



7 

and the ii) on-board shuttle group excellence 14 examples will be used for 
this purpose. 

3.1. If Software Fails, It May Cost Millions of Dollars and 
Injure People 

As already pointed in Section 1, software is pervasive (it is everywhere 
around us, even if we do not see it), it controls many devices used everyday, 
and more and more critical systems (i.e., those systems whose malfunction- 
ing can injure people or create high economic losses). 

The Ariane 5, X-ray machine and Denver Airport are some examples of 
critical systems which, due to software systems malfunctioning, ended up 
being big catastrophic failures a . The Ariane 5 shuttle, launched on June 
4th 1996, broke up and exploded forty seconds after initiation of the flight 
sequence, due to a software problem. People were killed. The Therac-25 
radiation-treatment machine for cancer therapy injured and even killed 
several patients by administering a radiation overdose. The Denver Airport 
software, responsible for controlling 35 kilometers of rails and 4000 tele- 
wagons never worked properly and after 10 years of recurring failures it has 
been recently dismissed: 15 millions of dollars were wasted. In all three cases, 
the main causes of failure were undisciplined management of requirements, 
imprecise and ambiguous communication, instable architectures, high com- 
plexities, incoherence among requirements design and implementation, low 
automation, and insufficient verification and validation. 

3.2. How to Make Good Software 

While the previous examples described what might happen when software 
engineering techniques are not taken into consideration, the on-board shut- 
tle group example of excellence (taken from the 1996 white-paper written 
by Fishman ) shows results that can be achieved when software engineer- 
ing best practices are applied in practice. It describes how the software 
governing a 120-ton space shuttle is conceived: such a software system is 
composed of around 500.000 lines of code, it controls 4 billion dollars worth 
of equipment, and decides the lives of a half-dozen astronauts. What makes 
this software and their creators so extraordinary, is that it never crashes 
and is bug free (according to 14 ). The last three versions of the program 



a Those and many other examples of catastrophic failures are described in . 
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(each one of 420,000 lines of code) had just one fault each. The last 11 ver- 
sions of this software system had a total of 17 faults. Commercial programs 
of equivalent complexity would have 5,000 faults. How did the developers 
achieve such high software quality? 

It was simply the result of applying most of the Software Engineering 
(SE) best practices: 



• SE allows for a disciplined, systematic, and quantifiable develop- 
ment: The on-board shuttle group is the antithesis of the up-all- 
night, pizza- and- roller-hockey software coders who have captured 
the public imagination. To be this good, the on-board shuttle group 
is very ordinary - indistinguishable, focused, disciplined, and me- 
thodically managed creative enterprise; 

• SE does not only concern programming: Another important factor 
discussed in the Fishman's report 14 is that about one-third of the 
process of writing software happens before anyone writes a line 
of code. Every critical requirement is documented. Nothing in the 
specification is changed without agreement and understanding. No 
coder changes a single line of code without carefully outlining the 
change; 

• SE takes into consideration maintenance and evolution: As explic- 
itly stated in 9 maintenance and evolution are important factors 
when engineering software systems. They allow system evolution, 
while limiting newly introduced faults; 

• SE is for mid to large systems: Applying SE practices is indeed 
expensive and requires effort. While non critical, small systems can 
require just a few SE principles, applying the best SE practices for 
the development of critical, large software systems is mandatory; 

• Development cost and time are key issues: The main success of 
this example is not the software but the software process the team 
uses. Recently, much effort has been spent on identifying new soft- 
ware processes (like the Unified Software Process 16 ), and software 
maturity frameworks which allows the improvement of the software 
development process (like the Capability Maturity Model - CMM 17 
or the Personal Software Process - PSP 18 ). Nowadays software pro- 
cesses take explicitly into consideration tasks like managing groups, 
setting deadlines, checking the system cost to stay on budget, and 
to deliver software which respects the expected qualities. 
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4. Fault Tolerance Engineering: from Requirements to 
Code 

Initial solutions relegated fault tolerance (and specifically, exception han- 
dling) very late during the design and implementation phases of the soft- 
ware life-cycle. More recently, instead, the need of explicit exception han- 
dling mechanisms during the entire life cycle has been advocated by some 
researchers as one of the main approaches to ensure the overall system 
dependability. 19 ' 20 

In particular, it has been recognized that different classes of faults, errors 
and failures can be identified during different phases of software develop- 
ment. A number of studies have been conducted so far aiming to understand 
where and how fault tolerance can be integrated in the software life-cycle. 

In the remaining part of Section 4 we will show how fault tolerance has 
been recently addressed at the different phases of the software process. The 
phases that will be taken into consideration are requirements, high-level 
(architectural) design, and low-level design thus reflecting current study on 
fault tolerance techniques during such phases. 

4.1. Requirements Engineering and Fault Tolerance 

Requirements Engineering is concerned with identifying the purpose of a 
software system, and the contexts in which it will be used. Different theo- 
ries and methodologies for finding out, modelling, analyzing, evolving and 
checking software system requirements 21 have been proposed so far. 

Being requirements the first artefacts produced during the software pro- 
cess, it is important to document expected faults and ways to tolerate them. 
Some approaches have been proposed for this purpose, the most known be- 
ing analyzed hi 20 - 22 ' 23 and subsequent work. 

In 22 ' 24 ' 24 the authors describe a process for systematically investigating 
exceptional situations at the requirements level and provide an extension 
to standard UML use case diagrams in order to specify exceptional be- 
haviour. In 20 it is described how exceptional behaviours can be specified at 
the requirements level, and how those requirements can drive component- 
based specification and design according to the Catalysis process. In 23 an 
approach for analyzing the safety and reliability of requirements based on 
use cases is proposed: normal use cases are extended with exceptional use 
cases according to, 22 then use cases are annotated with their probability of 
success and successively translated into Dependability Assessment Charts, 
eventually used for dependability analysis. 
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4.2. Software Architecture and Fault Tolerance 

Software Architecture (SA) has been largely accepted as a well suited con- 
cept to achieve a better software quality while reducing the time and cost 
of production. In particular, a software architecture specification 25 repre- 
sents the first, in the development life-cycle, complete system description. 
It provides both a high-level behavioural abstraction of components and of 
their interactions (connectors) and, a description of the static structure of 
the system. 

Typical SA specifications model only normal behaviour of the system, 
while ignoring abnormal ones. As a consequence, the system may fail in 
unexpected ways due to some faults. In the context of critical systems 
with fault tolerance requirements it becomes necessary to introduce fault 
tolerance information at the software architecture level. In fact, the error 
recovery effectiveness is dramatically reduced when fault tolerance is com- 
missioned late in the software life-cycle. 19 

Many approaches have been proposed for modelling and analyzing fault 
tolerant software architectures. While a comprehensive survey on this topic 
is described in, 26 this introductory section simply identifies the main topics 
covered by the existing approaches, while providing some references to the 
existing work. 

• Fault Tolerant SA specification: as discussed in many papers 
(e.g., 27-29 ) a software architecture can be specified using box-and- 
line notations, formal architecture description languages (ADLs) 
or UML-based notations. As far as the specification of fault toler- 
ant software architectures is concern, both formal and UML-based 
notations have been used. The approaches proposed in 30-32 are ex- 
amples of formal specifications for Fault Tolerant SA: traditional 
architecture description languages are usually extended in order 
to explicitly specify error and fault handling. The approaches in, 
e.g., 20,33 ' 34 use UML-based notations for modelling Fault Tolerant 
SA: new UML profiles are created in order to be able to specify 
fault tolerance concepts ;d 

• Fault Tolerant SA analysis: analysis techniques (such as deadlock 
detection, testing, checking, simulation, performance) allow soft- 
ware engineers to assess the software architecture and to evaluate 
its quality with respect to expected requirements. Some approaches 
have been proposed for analyzing Fault Tolerant SA: most of them 
check the architectural model conformance to fault tolerant require- 
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ments or constraints (like in 35 ' 36 ). A testing technique for Fault 
Tolerant SA is presented in; 34 

• Fault Tolerant SA styles: according to, 37 an architectural style is 
"a set of design rules that identify the kinds of components and 
connectors that may be used to compose a system or subsystem, 
together with local or global constraints on the way the composition 
is done v . Many architectural styles have been proposed for Fault 
Tolerant SA: the idealized fault tolerant style (in 34 ), the iC2C style 
(which integrates the C2 architectural style with the idealized fault 
tolerant component style 30 ), the idealized fault tolerant compo- 
nent/connector style; 38 

• Fault Tolerant SA middleware support: when coding software ar- 
chitecture via component-based systems, middleware technology 
can be used to implement connectors, coordination policies and 
many other features. In 39 a CORBA implementation of the pro- 
posed architectural exception handling is proposed. In 40 the au- 
thors propose an approach for exception handling in component 
composition at the architectural level with the support of middle- 
ware. Many projects have been conducted to provide fault toler- 
ance to CORBA applications, like AQuA, Eternal, IRL, and OGS 
(see 41 ). More details about middlewares are given in Section 6. 

4.3. Low-level Design and Fault Tolerance 

The low-level design phase (hereafter simply called "design") takes as in- 
put information collected during the requirement and architecting phases 
and produces a design artefact to be used by developers for guiding and 
documenting software coding. When dealing with fault tolerant systems, 
the design phase needs to benefit from some clear and domain specific tools 
and methodologies to drive the implementation of a particular fault toler- 
ant technique. This introductory section will present only such approaches 
that treat fault tolerant during the software development process (from 
requirements to code). 

In 42 ' 43 two approaches are presented for transiting from architectural 
design to low-level design through the definition of fault tolerant design pat- 
terns. In 33 an initial study on the CORRECT MDA approach is introduced: 
given a coordinated atomic action specification, the approach enables the 
automatic production of Java code. This approach has been successively re- 
fined in other papers, and recently presented in. 44 In 20 an approach for fault 
tolerance specification and analysis during the entire development process is 
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proposed. It considers how to specify normal and exceptional requirements, 
how to use this information for driving the component specification and de- 
sign phase, and how to implement all such information using a Java-based 
framework Java. The proposed software process is based on the Catalysis 
process. In 34 a similar strategy is adopted based on the UML Components 
Process (even if this paper is more towards testing). 

5. Verification and Validation of Fault Tolerant Systems 

Fault tolerance techniques alone are not enough to achieve full dependabil- 
ity, since unexpected faults cannot be always avoided nor tolerated. 45 In 
addition it is important to note that fault tolerant systems inevitably con- 
tain faults. Verification and validation (V&V) techniques are demonstrated 
successful means to assure that expected properties and requirements are 
satisfied in system models and implementation. 34 In this setting V&V tech- 
niques are the solution for removing faults from the system. 

Furthermore, V&V should be used at each different life-cycle phase since 
fault tolerance engineers the entire software development life-cycle. 

Different classes of faults, errors, and failures must be identified and 
dealt with at each phase of software development, depending on the ab- 
straction level used in modeling the software system under development. 
Thus, each abstraction level requires specific design models, implementa- 
tion schemes, verification techniques, and verification environments. 

Verification and validation techniques aim to ensure the correctness of a 
software system or at least to reduce the number or severity of faults both 
during the development and after the deployment of a system. There are two 
different class of verification methods, exhaustive methods that conduct an 
exhaustive exploration of all possible behaviours and non-exhaustive meth- 
ods that explore only some of the possible behaviours. In the exhaustive 
class there are model checking, theorem provers, term rewriting systems, 
proof checker systems, and constraint solvers. In the non-exhaustive class 
there are testing and simulation, the veteran techniques, commonly and 
widely used but that can easily miss significant errors when verifying com- 
plex and huge systems. In literature several approaches have been proposed 
in the last years trying to apply V&V techniques to fault-tolerant systems 
and they are surveyed in the following subsections. In particular, Section 5.1 
reports the use of model checking techniques for fault-tolerant systems, Sec- 
tion 5.2 summarizes experiences with theorem provers applied to fault tol- 
erant systems, Section 5.3 shows approaches that exploit constraint solvers 
for verifying fault tolerant systems, and finally Section 5.4 reports how 
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fault tolerance has been complemented with testing techniques. Further- 
more, with the introduction of UML 46 as the de-facto standard to model 
software systems and its widespread adoption in industrial contexts, many 
approaches have been proposed to use UML for modeling and evaluating 
dependable systems (e.g., 47-50 ); they are reported in Section 5.5. 



5.1. Model Checking 

Model checkers take as input a formal model of the system, typically de- 
scribed by means of state machines or transition systems, and verify if it 
satisfies temporal logic properties. 51 

Several approaches have been proposed in the last years focusing on 
model checking of fault tolerant systems, such as. 52-54 

These papers describe approaches and show their application to real 
case studies testifying that model checking is a promising and successful 
verification technique. First of all model checking techniques are supported 
by tools, which facilitates their application. Secondly, in case the verification 
detects a violation of a desired property, a counter example showing how 
the system reaches the erroneous state in which the property is violated is 
produced. 

Model checking approaches for fault tolerant systems typically require 
to specify normal behaviours, failing behaviours, and fault recovering pro- 
cedures. Thus, fault tolerant systems are subjected to the state explosion 
problem that afflicts model checkers also in verifying systems that do not 
consider exceptional behaviours. 

One approach that can be used for avoiding the state explosion problem 
is the partial model checking technique introduced in. 55 This technique that 
tries to gradually remove parts of the system is successfully applied for 
security analysis and an attempt to use it for fault tolerant systems is in. 56 



5.2. Theorem Provers 

Interactive theorem provers start with axioms and try to produce new infer- 
ence steps using rules of inference. They require a human user to give hints 
to the system. Working on hard problems usually requires a skilled user. A 
logic characterization of fault tolerance is given in, 56 while approaches that 
apply theorem prover techniques to fault tolerant systems are in. 57-60 
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5.3. Constraint Solvers 

Given a logical formula, expressed in a suitable logic, constraint solvers at- 
tempt to find a model that makes the formula true. The model typically is 
a match between variables and values. One of the most famous constraint 
solvers, based on first-order logic, is Alloy Analyzer, the verification en- 
gine of Alloy. 61 In, 62 authors propose an approach that exploits Alloy for 
modeling and formally verifying fault-tolerant distributed systems. More 
precisely they focus on systems that use exception handling as mechanism 
for fault tolerance and in particular they consider systems designed by using 
Coordinated Atomic Actions (CAA). 63 CAA is a fault-tolerant mechanism 
that uses concurrent exception handling and unifies the features of two 
complementary concepts: conversation and transaction. Conversation 64 is 
a fault-tolerant technique for performing coordinated error recovery in a 
set of participants that have been designed to interact with each other to 
provide a specific service (cooperative concurrency). 

5.4. Testing 

Testing refers to the dynamic verification of a system's behaviour based on 
the observation of a selected set of controlled executions, or test cases. 65 
Testing is the main fault removal technique. 

A real world project involving 34 independent programming teams for 
developing program versions of an industry-scale avionics application is pre- 
sented in. 66 Detailed experimentations are reported to study the nature, 
source, type, delegability, and effect of faults uncovered in the program- 
ming versions. A new test generation technique is also presented with an 
evaluation of its effectiveness. 

Another approach 67 shows how fault tolerance and testing can be used 
to validate component-based systems. Fault tolerance requirements guide 
the construction of a fault-tolerant architecture, which is successively vali- 
dated with respect to requirements and submitted to testing. 

5.5. UML-based approaches for modeling and validating 
dependable systems 

The approaches considered in this section share the idea of translating 
design models into reliability models. 

A lot of works and solutions have been proposed in the context of the 
European ESPRIT project HIDE. 49 This project aims at the creation of 
an integrated environment for designing and verifying dependable systems 
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modeled in UML. In 48 authors propose automatic transformations from 
UML specifications, augmented with additional required information (i.e., 
fault occurrence rate, percentage of permanent faults, etc.), to Petri Net 
Models. 

A modular and hierarchical approach for dependable software archi- 
tectures is proposed in. 50 The language used for describing software ar- 
chitectures is UML. The approach suggests a refinement process allowing 
the description of critical parts of the model when information becomes 
available in the following design phases. 

In 47 authors convert UML models to dynamic fault trees. In this work 
UML is mainly used as a language for describing module substitution and 
error propagation. 



6. Languages and Frameworks 

It is of great importance that engineers could find in their development 
tools features that help them to deal with the increase of the complexity 
because of the incorporation of fault-tolerant software techniques into the 
final software. Each development tool studied in this section helps in sep- 
arating the code to implement the software system function (as described 
by its functional specification) from the necessary code to implement the 
service restoration (or simply "recovery"), when a deviation from the cor- 
rect service was detected (by the implemented error detection technique, 
of course). The choice of recovery features depends on the classes of faults 
that to be tolerated. For example, transient faults, which are the faults that 
eventually disappear without any apparent intervention, can be tolerated 
by error handling. 

Each of the tools that is presented below, allows engineers to solve this 
issue and sometimes with several possibilities. The selection amongst all the 
proposed solution paths will depend on: costs in terms of money, processing 
power (performance), and memory size. But also on the costs (quantifiable 
or not) induced by the failure of the software system, which is the most 
important one to evaluate precisely in order to decide on the requirements 
for fault tolerance. 

This section addresses three types of development environments: pro- 
gramming languages, fault-tolerant frameworks and advanced fault-tolerant 
frameworks. The choice among these three categories will depend on the 
complexity of the fault tolerance requirements. 
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6.1. Programming Languages Perspectives 

Some programming languages incorporate fault tolerance techniques di- 
rectly as part of their syntax or indirectly by features that allow engineers 
to implement them. One reason for having fault tolerance support at the 
programming language level is due to the increased performance as conse- 
quence of the application-specific knowledge. Another reason is to offer to 
programmers that need to comply to standard programming languages the 
capability to design and develop more easily fault-tolerant applications. 

6.1.1. Exception Handling 

As stated in the previous section, one of the features to be provided by 
a fault tolerance technique is to support separation of the fault tolerance 
instructions (for recovery objectives) from the rest of the software and to 
activate them automatically, when necessary. The obvious moment to acti- 
vate the recovery behaviour is when it is impossible to finish the operation 
that the software is carrying out. An exception is exactly defined as the 
notification of the impossibility of finishing an operation and it can be used 
to know that the software is going to fail if no action is taken. It points out 
that the period of time between the notification of the exception and the 
failure of the software can be used to apply the fault-tolerant instructions 
in order to keep the software running and avoid a failure. This is called 
Exception handling (EH) . It is the most popular way used by modern soft- 
ware and it plays a vital part in the implementation of fault tolerance in 
software system. 

Nowadays, various exception handling models are part of practical pro- 
gramming languages like Ada, C++, Eiffel, Java, ML and Smalltalk. Al- 
most all languages have similar basic types of exceptions and similar con- 
structs to introduce new exception types and to handle exceptions (e.g. 
try /throw/ catch in Java). Thus, exception handling is a good technique to 
implement fault-tolerant sequential programs. 

In the context of distributed concurrent software (network of computing 
nodes), the situation is different. The exception handling must be defined 
according to the semantics of concurrency and distribution. ERLANG 68 is a 
declarative language for programming concurrent and distributed software 
systems with EH features. This language has primitives which support the 
creation of processes (separated unit of computation), the communication 
between processes over a network of nodes and the handling of errors when 
a failure caused a process to terminate abnormally. The statement spawn 
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allows a process to create a new process on a remote node. Whenever a 
new process is created, the new process will belong to the same process 
group as the process that evaluated the spawn statement. Once a process 
terminates (normally or abnormally) its execution, a special signal is sent 
to all processes which belong to the same group. The value of one of the 
parameters that compose the signal is used to detect if the process termi- 
nated abnormally. In order to avoid propagating an abnormal signal to the 
other processes of the group (i.e. to ensure failure containment), the default 
behaviour must be changed. This is achieved by using the catch instruc- 
tion, which defines a scope to deal with errors occurred on the monitored 
expression. 



6.1.2. Atomic Actions 

Fault tolerance in distributed concurrent software systems can also be 
achieved using the general concept of atomic actions. A group of com- 
ponents (participants, threads, processes, objects, etc.) that cooperate to 
achieve a joint goal, without information flow between the group and the 
rest of the system for the period necessary to achieve the goal, consti- 
tutes an atomic action. These components are designed to cooperate inside 
the action, so that they share work and exchange information in order 
to complete the action successfully. Atomicity guarantees that if the ac- 
tion is successfully executed, then its results and modifications on shared 
data become visible to other actions. But if an error is detected, all the 
components take part in a cooperative recovery in order to return with- 
out changes on the shared data. These scheme characteristics allows the 
containment and recovery to be easily achieved since the error detection, 
propagation and recovery all occur within a single atomic action. Therefore 
fault tolerance steps can be attached to each of the atomic action that forms 
part of the software, independently from each other. The first fault-tolerant 
atomic action scheme proposed was the conversation scheme. 64 It allows 
tolerating design faults by making use of software diversity and partici- 
pant rollback. Other schemes including fault tolerance has been proposed 
since then and developed as part of programming languages. For exam- 
ple, Avalon, 69 takes advantage of inheritance to implement atomic actions 
in distributed object-oriented applications. Avalon relies on the Camelot 70 
system to handle operative-system level details. Much of the Avalon design 
has been inspired by Argus. 71 



November 9, 2010 3:29 



WSPC - Proceedings Trim Size: 9in x 6in SEFT-HM-070213 



18 

6.1.3. Reflection and Aspect- Orientation 

Other software technology considered for handling software faults and which 
is related to programming languages is "reflection". 72 Reflection is the abil- 
ity of a computational system to observe its own execution and, as a result 
of that observance, perhaps make changes to that execution. Software based 
on reflective facilities is structured into different levels: the base level and 
one or more metalevels. Everything in the implementation and application 
(the syntax, the semantics, and the run-time data structures) is "open" 
to the programmer for modification via the metalevels. 73 The metalevels 
can be used to handle the fault tolerance strategies. Therefore, this layer 
structure allows programmers to separate the recovery steps (part of the 
metalevels) from necessary steps to reach the functional goal (part of the 
base level). The fact that metalevels can observe the base level computation 
allows halting its execution when any deviation is observed (according to 
some parameter of reference) to start the recovery. 

A generic solution to implement fault tolerance that is or can be chosen 
for many programming languages consists in: 

• extending the programming language with non-standard constructs 
and semantics; 

• extending the implementation environment underlying the pro- 
gramming language to provide the functionality, but with an inter- 
face expressed using existing language constructs and semantics; 

• extending the language with specific abstractions, implemented 
with existing language constructs and semantics (e.g., abstract data 
types intended to support software fault tolerance) perhaps ex- 
pressed in well- recognised design patterns; 

• or combinations of the previous approaches. 

Aspect-orientation has been accepted as a powerful technique for mod- 
ularizing crosscutting concerns during software development in so- called 
aspects. Similar to reflection, aspect-oriented techniques provide means to 
extend a base programs with additional state, and define additional be- 
haviour that is to be triggered at well-defined points during the execution 
of the program. Experience has shown that aspect-oriented programming is 
successful in modularizing even very application-independent, general con- 
cerns such as distribution and concurrency, and examples of using aspect- 
orientation to achieve fault tolerance are given in. 74-78 

Nevertheless, if complex fault tolerance requirements are given to your 
application, choosing programming languages will be risky and the selection 
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of a framework will then be necessary. 
6.2. Frameworks for Fault Tolerance 

According to 79 "a framework is a reusable design expressed as a set of 
abstract classes and the way their instances collaborate. It is a reusable 
design for all or part of a software system. By definition, a framework is 
an object-oriented design. It doesn't have to be implemented in an object- 
oriented language, though it usually is. Large-scale reuse of object-oriented 
libraries requires frameworks. The framework provides a context for the 
components in the library to be reused. " 

CORBA (Common Object Request Broker Architecture), is a good ex- 
ample of a framework, which was conceived to provide application interop- 
erability. Unfortunately, CORBA and other traditional frameworks cannot 
often meet the demanding quality of service (QoS) requirements (includ- 
ing the fault-tolerance ones) for certain specialised applications. This is 
why these frameworks are often extended to include fault tolerance tech- 
niques in order to become predictable and reliable. This is done in FT 
CORBA specification. 80 It defines architecture, a set of services, and asso- 
ciated fault tolerance mechanisms that constitute a framework for resilient, 
highly-available, distributed software systems. Fault tolerance is achieved 
by features that allow designers to replicate objects in a transparent way. 
The set of several replicas for a specific object defines an object group. 
However, a client object is not aware that the server object is replicated 
(server object group). Therefore, the client object invokes methods on the 
server object group, and the members of the server object group execute the 
methods and return their responses to the client, just like a conventional 
object. 

The same approach has also been followed at the higher levels of abstrac- 
tion. For example, the well-known coordination language Linda, 81 which has 
been extended to facilitate programming of fault-tolerant parallel applica- 
tions. FT-Linda, 82 for instance, is an extension of the original Linda model, 
which defines new concepts as stable tuple spaces (stable TSs) , failure tuple 
and new syntax to achieve atomic execution of a scries of TSs operations. 
A stable TS represents a tuple that survives to a failure. This tuple stabil- 
ity is achieved by replicating the tuple on the multiple hosts that conform 
the distributed environment where the software is deployed. FT-Linda uses 
monitoring to detect failures. The main type of failure that is addressed by 
FT-Linda corresponds to host failure, which means that the host has been 
silent for longer than a pre-defined interval. When such type of failure is 
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detected, the framework automatically notifies all processes by signalling a 
failure tuple in a stable TS available to them. Notice, there must exist a 
specific process in the software in charge of watching for a failure tuple and 
starting the corresponding recovery process. Atomic execution is denoted 
by angle brackets and represents the all-or-none execution of the group of 
tuple space operations enclosed by the angle brackets. There are also some 
language primitives that allow tuples to be moved or copied from one TS 
to another one. 

Other extension of Linda are oriented towards mobile applications us- 
ing the agent paradigm, is CAMA. 83 It brings fault tolerance to mobile 
agent applications, by the use of a novel exception handling mechanisms 
developed for this type of applications. This exception handling mechanism 
consists in attaching application-specified handlers to agents to achieve re- 
covery. Therefore, to the set of primitives derived from Linda (e.g. create, 
delete, put, get, etc.), CAMA adds some primitives to catch and raise inter- 
agent exceptions (e.g. raise, check, wait). 

Linda has strongly influenced the JavaSpaces 84 system design. They are 
similar in the sense that they store collections of information (resources) 
which can be later searchable by value-based lookup in order to allow mem- 
bers of distributed software systems to exchange information. JavaSpaces 
is part of Jini Network Technology, 85 which is an open architecture that 
enables developers to create network-centric service. Jini has a basic mech- 
anism used for fault-tolerant resource control, which is referred to as Lease. 
This mechanism is used to define a holding interval of time on a resource by 
the party that requests access on such resource. The mechanisms notifies 
(error detection) when the leases expires, so that actions (recovery) on lease 
expiration can be taken for the requestor party. 

Even though these framework extensions represent good tools to imple- 
ment fault-tolerance they are still "extensions" of existing tools. The next 
section will present frameworks that have been designed with the central 
objective to support the design and implementation of fault tolerance. 

6.3. Advanced Frameworks for Fault Tolerance 

Frameworks that have been defined to allow designers/programmers to de- 
velop fault-tolerant software by the implementation of fault tolerance tech- 
niques share, in general, the following characteristics: 86 

• many details of their implementation are made transparent to the 
programmers; 



November 9, 2010 3:29 



WSPC - Proceedings Trim Size: 9in x 6in SEFT-HM-070213 



21 

• they provide well-defined interfaces for the definition and imple- 
mentation of fault tolerance techniques; 

• they are recursive in nature (each component can be seen as a 
system itself). 

Arjuna 87 programmed in C++ and Java, is one of those. Arjuna per- 
mits the construction of reliable distributed applications in a relatively 
transparent manner. Reliability is achieved through the provision of tra- 
ditional atomic transaction mechanisms implemented using only standard 
language features. It provides basic capabilities to handle recovery, persis- 
tence and concurrency control to the programmer, while at the same time 
giving flexibility to the software by allowing those capabilities to be refined 
as required by the demands of the application. 

Other similar work is OPTIMA, 88 which is an object-oriented frame- 
work (developed for Ada, Java and AspectJ) that provides the necessary 
runtime support for the "Open Multithreaded Transactions" (OMT) model. 
OMT is a transaction model that provides features for controlling and 
structuring not only accesses to objects, as usually happens in transac- 
tion systems, but also threads taking part in a same transaction in order 
to perform a joint activity. The framework provides features to fork and 
terminate threads inside a transaction, but restricting their behaviours in 
order to guarantee correctness of transaction nesting and isolation among 
transactions. 

The DRIP framework 89 provides a technological support to implement 
DIP 90 models in Java. DIP combines Multiparty Interactions 91 ' 92 and Ex- 
ception Handling 93 in order to support design dependable interactions 
among several processes. As an extension of DRIP, the CAA-DRIP im- 
plementation framework 44 provides a way to execute Java programs that 
have been designed using the COALA design language 94 that exploits the 
Coordinated Atomic actions (CA actions) concept. 95 

Other advanced frameworks for fault-tolerance are currently being de- 
signed. For example, the MctaSolve design framework, supporting dynamic 
selection and parallel composition of services, has been proposed for devel- 
oping dependable systems. 96 This approach defines an architectural model, 
which makes use of service-oriented architecture features to implement fault 
tolerance techniques based on meta-data. Furthermore, the software im- 
plemented using the MetaSolve approach will be able to adapt itself at 
run-time in order to provide dynamic fault-tolerance. The ability of such 
software to dynamically respond to potentially damaging changes by adap- 
tation in order to maintain an acceptable level of service is refereed to as 
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dynamic resilience. The architecture of software that follows this approach 
relies on dynamic information about software components in order to make 
decisions for dynamic reconfiguration. Such metadata then will be used in 
accordance with some resilient policies ensuring that the desirable depend- 
able requirements are met. 

7. Contribution of this Book to the Topic 

This book will contribute to the overall topic of Software Engineering and 
Fault Tolerance with nine papers, briefly described in the following, and 
categorized according to the three parts identified at the beginning of this 
paper: 

• Part A: Fault tolerance engineering: from requirements to code 

— In "Exploiting Reflection to Enable Scalable and Performant 
Database Replication at the Middleware Level" Jorge Solas, 
Ricardo Jimenez-Peris, Maria Patino-Martinez and Bettina 
Kemme introduce a design pattern for data base replication 
using reflection at interface level. It permits a clear separation 
between regular function and replication logic. This design 
pattern allows to obtain good performance and scalabilitie 
properties. 

— In "Adding Fault- Tolerance to State Machine-Based Designs'' ', 
Sandeep S. Kulkarni, Anish Arora and Ali Ebnenasir present a 
non application specific approach for automatic re-engineering 
of code in order to make it fault-tolerant and safety. This 
generic approach is using model based transformations of pro- 
grams that must be atomic concerning accesses to variables. 

— In "Replication in Service- Oriented Systems", Johannes Os- 
rael, Lorenz Froihofer and Karl M. Goeschka present a state 
of the art of replication protocols and of replication in ser- 
vice oriented architectures supported by middleware's. Then 
is shows how to enhance to existing solutions provided in the 
service oriented field with already known designs for replica- 
tion in traditional systems. 

• Part B: Verification and validation of fault tolerant systems 

— In "Embedded Software Validation Using On-Chip Debugging 
Mechanisms" , Juan Pardo, Jose Carlos Campelo, Juan Carlos 
Ruiz and Pedro Gil present how to practically use on-chip de- 
bugging to perform fault-injection in a non-intrusive way. This 
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portable approach offers a verification and validation mean for 
checking and validating the robustness of COTS based embed- 
ded systems. 

— In "Error Detection in Control Flow of Event-Driven State 
Based Applications", Gergely Pinter and Istvan Majzik 
present a formal approach using state-charts to detect two 
classes of faults: the one made during state-chart refinement 
(using temporal logic model-checking) and the one made at 
implementation (using model based testing). 

— In "Fault- Tolerant Communication for Distributed Embedded 
Systems", Christian Kiihnel and Maria Spichkova present a 
formal specification using the FOCUS formal framework of 
FlexRay and FTCom. It thus provides a precise semantics 
useful for analyzing dependencies and for the verification (us- 
ing Isabcllc/HOL) of existing implementations. 

• Part C: Languages and Tools for engineering fault tolerant systems 

— In "A Model Driven Exception Management Framework" , Su- 
san Entwisle and Elizabeth Kendall present a model driven 
engineering approach to the engineering of fault-tolerant sys- 
tems. An iterative development process using the UML 2 mod- 
elling language and model transformations is proposed. The 
engineering framework proposes generic transformations for 
exceptions handling strategies thus raising the exception han- 
dling at higher level of abstraction than only implementation. 

— In "Runtime Failure Detection and Adaptive Repair for Fault- 
Tolerant Component-Based Applications", Rong Su, Michel 
Chaudron and Johan Lukkien present formally a fault man- 
agement mechanism useful for systems designed using a com- 
ponent model. Run-time failure are detected and the repair 
strategy is selected using a rule-based approach. An adaptive 
technique is proposed to enhance progressively the selection 
of the repair strategy. A forthcoming development framework, 
Robocop, will provide an implementation of this mechanism. 

— In "Extending the Applicability of the Neko Framework for 
the Qualitative and Quantitative Validation and Verification 
of Distributed Algorithms" , Lorenzo Falai and Andrea Bon- 
davalli present a development framework written in Java al- 
lowing rapid prototyping of Java distributed algorithms. An 
import function allows for direct integration of C and C++ 
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programs via glue-code. The framework offers some techniques 
for qualitative analysis that can be used specifically for the 
fault tolerance parts of the distributed program developed 
with the framework. 
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